NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

An Evaluation of Interleaved Instruction Tuning on Semantic Reasoning Performance in an Audio MLLM

Liu, Jiawei; Çoban, Enis Berk; Schevchenko, Zarina; Tang, Hao; Zhu, Zhigang; Mandel, Michael; Devaney, Johanna (December 2025, NeurIPS 2025 Multimodal Algorithmic Reasoning Workshop)

Standard training for Multi-modal Large Language Models (MLLMs) involves concatenating non-textual information, like vision or audio, with a text prompt. This approach may not encourage deep integration of modalities, limiting the model's ability to leverage the core language model's reasoning capabilities. This work examined the impact of interleaved instruction tuning in an audio MLLM, where audio tokens are interleaved within the prompt. Using the Listen, Think, and Understand (LTU) model as a testbed, we conduct an experiment using the Synonym and Hypernym Audio Reasoning Dataset (SHARD), our newly created reasoning benchmark for audio-based semantic reasoning focusing on synonym and hypernym recognition. Our findings show that while even zero-shot interleaved prompting improves performance on our reasoning tasks, a small amount of fine-tuning using interleaved training prompts improves the results further, however, at the expense of the MLLM's audio labeling ability.
more » « less
Free, publicly-accessible full text available December 7, 2026
What do MLLMs hear? Examining reasoning with text and audio encoder components in Multimodal Large Language Models.

Coban, Enis Berk; Mandel, Michael I; Devaney, Johanna (October 2024, NeurIPS Audio Imagination Workshop)

Large Language Models (LLMs) have demonstrated remarkable reasoning capabilities, notably in connecting ideas and adhering to logical rules to solve problems. These models have evolved to accommodate various data modalities, including sound and images, known as multimodal LLMs (MLLMs), which are capable of describing images or sound recordings. Previous work has demonstrated that when the LLM component in MLLMs is frozen, the audio or visual encoder serves to caption the sound or image input facilitating text-based reasoning with the LLM component. We are interested in using the LLM's reasoning capabilities in order to facilitate classification. In this paper, we demonstrate through a captioning/classification experiment that an audio MLLM cannot fully leverage its LLM's text-based reasoning when generating audio captions. We also consider how this may be due to MLLMs separately representing auditory and textual information such that it severs the reasoning pathway from the LLM to the audio encoder.
more » « less
Full Text Available
Towards High Resolution Weather Monitoring With Sound Data

https://doi.org/10.1109/ICASSP48485.2024.10445999

Çoban, Enis Berk; Perra, Megan; Mandel, Michael I (April 2024, IEEE)

Full Text Available
Data-Centric Methods for Environmental Sound Classification With Limited Labels

https://doi.org/10.1109/TASLP.2024.3414332

Syed, Ali Raza; Çoban, Enis Berk; Pir, Dara; Mandel, Michael (January 2024, IEEE/ACM Transactions on Audio, Speech, and Language Processing)

Full Text Available
Estimating Shapley Values of Training Utterances for Automatic Speech Recognition Models

https://doi.org/10.1109/ICASSP49357.2023.10097237

Raza Syed, Ali; Mandel, Michael I. (June 2023, IEEE International Conference on Acoustics Speech and Signal Processing)

Data Valuation in machine learning is concerned with quantifying the relative contribution of a training example to a model’s performance. Quantifying the importance of training examples is useful for identifying high and low quality data to curate training datasets and for address data quality issues. Shapley values have gained traction in machine learning for curating training data and identifying data quality issues. While computing the Shapley values of training examples is computationally prohibitive, approximation methods have been used successfully for classification models in computer vision tasks. We investigate data valuation for Automatic Speech Recognition models which perform a structured prediction task and propose a method for estimating Shapley values for these models. We show that a proxy model can be learned for the acoustic model component of an end-to-end ASR and used to estimate Shapley values for acoustic frames. We present a method for using the proxy acoustic model to estimate Shapley values for variable length utterances and demonstrate that the Shapley values provide a signal of example quality.
more » « less
Full Text Available
ImportantAug: A Data Augmentation Agent for Speech

https://doi.org/10.1109/ICASSP43922.2022.9747003

Trinh, Viet Anh; Salami Kavaki, Hassan; Mandel, Michael I (May 2022, IEEE International Conference on Acoustics, Speech and Signal Processing)

We introduce ImportantAug, a technique to augment training data for speech classification and recognition models by adding noise to unimportant regions of the speech and not to important regions. Importance is predicted for each utterance by a data augmentation agent that is trained to maximize the amount of noise it adds while minimizing its impact on recognition performance. The effectiveness of our method is illustrated on version two of the Google Speech Commands (GSC) dataset. On the standard GSC test set, it achieves a 23.3% relative error rate reduction compared to conventional noise augmentation which applies noise to speech without regard to where it might be most effective. It also provides a 25.4% error rate reduction compared to a baseline without data augmentation. Additionally, the proposed ImportantAug outperforms the conventional noise augmentation and the baseline on two test sets with additional noise added.
more » « less
Full Text Available
EDANSA-2019: The Ecoacoustic Dataset from Arctic North Slope Alaska

https://doi.org/10.5281/zenodo.6824272

Çoban, Enis Berk; Perra, Megan; Pir, Dara; Mandel, Michael (January 2022, Zenodo)

We are sharing the Ecoacoustic Dataset from Arctic North Slope Alaska (EDANSA-2019), a dataset with audio samples collected from the area of 9000 square miles throughout the 2019 summer season on the North Slope of Alaska and neighboring regions.</p> There are over 27 hours of labeled data according to 28 tags with enough instances of 9 important environmental classes to train baseline convolutional recognizers.</p> Please see the following GitHub page for the accompanying publication, updates about the dataset, and baseline code: https://github.com/speechLabBcCuny/EDANSA-2019 </p>
more » « less
EDANSA-2019: THE ECOACOUSTIC DATASET FROM ARCTIC NORTH SLOPE ALASKA

Çoban, Enis Berk; Perra, Megan; Pir, Dara; Mandel, Michael I. (January 2022, Workshop on the Detection and Classification of Acoustic Scenes and Events)

The arctic is warming at three times the rate of the global average, affecting the habitat and lifecycles of migratory species that reproduce there, like birds and caribou. Ecoacoustic monitoring can help efficiently track changes in animal phenology and behavior over large areas so that the impacts of climate change on these species can be better understood and potentially mitigated. We introduce here the Ecoacoustic Dataset from Arctic North Slope Alaska (EDANSA-2019), a dataset collected by a network of 100 autonomous recording units covering an area of 9000 square miles over the course of the 2019 summer season on the North Slope of Alaska and neighboring regions. We labeled over 27 hours of this dataset according to 28 tags with enough instances of 9 important environmental classes to train baseline convolutional recognizers. We are releasing this dataset and the corresponding baseline to the community to accelerate the recognition of these sounds and facilitate automated analyses of large-scale ecoacoustic databases.
more » « less
Full Text Available
Towards Large Scale Ecoacoustic Monitoring with Small Amounts of Labeled Data

https://doi.org/10.1109/WASPAA52581.2021.9632743

Coban, Enis Berk; Syed, Ali Raza; Pir, Dara; Mandel, Michael I (October 2021, IEEE Workshop on Applications of Signal Processing to Audio and Acoustics)

Arctic boreal forests are warming at a rate 2–3 times faster than the global average. It is important to understand the effects of this warming on the activities of animals that migrate to these environments annually to reproduce. Acoustic sensors can monitor a wide area relatively cheaply, producing large amounts of data that need to be automatically analyzed. In such scenarios, only a small proportion of the recorded data can be labeled by hand, thus we explore two methods for utilizing labels more efficiently: self-supervised learning using wav2vec 2.0 and data valuation using k-nearest neighbors approximations to compute Shapley values. We confirm that data augmentation and global temporal pooling improve performance by more than 30%, demonstrate for the first time the utility of Shapley data valuation for audio classification, and find that our wav2vec 2.0 model trained from scratch does not improve performance.
more » « less
Full Text Available
Directly Comparing the Listening Strategies of Humans and Machines

https://doi.org/10.1109/TASLP.2020.3040545

Trinh, Viet Anh; Mandel, Michael (January 2021, IEEE/ACM Transactions on Audio, Speech, and Language Processing)
null (Ed.)
Full Text Available

« Prev Next »

Search for: All records